Burke County
Closing the Performance Gap Between AI and Radiologists in Chest X-Ray Reporting
Sharma, Harshita, Reynolds, Maxwell C., Salvatelli, Valentina, Sykes, Anne-Marie G., Horst, Kelly K., Schwaighofer, Anton, Ilse, Maximilian, Melnichenko, Olesya, Bond-Taylor, Sam, Pérez-García, Fernando, Mugu, Vamshi K., Chan, Alex, Colak, Ceylan, Swartz, Shelby A., Nashawaty, Motassem B., Gonzalez, Austin J., Ouellette, Heather A., Erdal, Selnur B., Schueler, Beth A., Wetscherek, Maria T., Codella, Noel, Jain, Mohit, Bannur, Shruthi, Bouzid, Kenza, Castro, Daniel C., Hyland, Stephanie, Korfiatis, Panos, Khandelwal, Ashish, Alvarez-Valle, Javier
AI-assisted report generation offers the opportunity to reduce radiologists' workload stemming from expanded screening guidelines, complex cases and workforce shortages, while maintaining diagnostic accuracy. In addition to describing pathological findings in chest X-ray reports, interpreting lines and tubes (L&T) is demanding and repetitive for radiologists, especially with high patient volumes. We introduce MAIRA-X, a clinically evaluated multimodal AI model for longitudinal chest X-ray (CXR) report generation, that encompasses both clinical findings and L&T reporting. Developed using a large-scale, multi-site, longitudinal dataset of 3.1 million studies (comprising 6 million images from 806k patients) from Mayo Clinic, MAIRA-X was evaluated on three holdout datasets and the public MIMIC-CXR dataset, where it significantly improved AI-generated reports over the state of the art on lexical quality, clinical correctness, and L&T-related elements. A novel L&T-specific metrics framework was developed to assess accuracy in reporting attributes such as type, longitudinal change and placement. A first-of-its-kind retrospective user evaluation study was conducted with nine radiologists of varying experience, who blindly reviewed 600 studies from distinct subjects. The user study found comparable rates of critical errors (3.0% for original vs. 4.6% for AI-generated reports) and a similar rate of acceptable sentences (97.8% for original vs. 97.4% for AI-generated reports), marking a significant improvement over prior user studies with larger gaps and higher error rates. Our results suggest that MAIRA-X can effectively assist radiologists, particularly in high-volume clinical settings.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > North Dakota > Burke County (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
- Information Technology > Sensing and Signal Processing > Image Processing (0.92)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)
AI-Mediated Communication Reshapes Social Structure in Opinion-Diverse Groups
Huq, Faria, Claggett, Elijah L., Shirado, Hirokazu
Group segregation or cohesion can emerge from micro-level communication, and AI-assisted messaging may shape this process. Here, we report a preregistered online experiment (N = 557 across 60 sessions) in which participants discussed controversial political topics over multiple rounds and could freely change groups. Some participants received real-time message suggestions from a large language model (LLM), either personalized to their stance ("individual assistance") or incorporating their group members' perspectives ("relational assistance"). We find that small variations in AI-mediated communication cascade into macro-level differences in group composition. Participants with individual assistance send more messages and show greater stance-based clustering, whereas those with relational assistance use more receptive language and form more heterogeneous ties. Hybrid expressive processes--jointly produced by humans and AI--can reshape collective organization. The patterns of structural division and cohesion depend on how AI incorporates users' interaction context. Understanding how micro-level communication patterns accumulate into macro-level group segregation or cohesion is a central question in social and behavioral science [1-3]. Conversations across differences are often asymmetric: people find it difficult to engage constructively with those who hold opposing views [4, 5], and stereotypes bias perceptions of outgroup members [6]. Online platforms can intensify these dynamics through lowered inhibitions [9], emotion-amplified diffusion [10], and algorithmic or behavioral clustering processes [11-13]. While the forces that produce social division are well theorized and empirically documented, far less is known about the micro-level conversational mechanisms that can instead generate cohesion in ideollogically diverse groups [14-16].
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Law (0.68)
- Government > Regional Government (0.47)
- Government > Immigration & Customs (0.46)
OnPrem.LLM: A Privacy-Conscious Document Intelligence Toolkit
We present OnPrem$.$LLM, a Python-based toolkit for applying large language models (LLMs) to sensitive, non-public data in offline or restricted environments. The system is designed for privacy-preserving use cases and provides prebuilt pipelines for document processing and storage, retrieval-augmented generation (RAG), information extraction, summarization, classification, and prompt/output processing with minimal configuration. OnPrem$.$LLM supports multiple LLM backends -- including llama$.$cpp, Ollama, vLLM, and Hugging Face Transformers -- with quantized model support, GPU acceleration, and seamless backend switching. Although designed for fully local execution, OnPrem$.$LLM also supports integration with a wide range of cloud LLM providers when permitted, enabling hybrid deployments that balance performance with data control. A no-code web interface extends accessibility to non-technical users.
- North America > United States > Virginia > Alexandria County > Alexandria (0.04)
- North America > United States > North Dakota > Burke County (0.04)
Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks
Frank, Kevin, Gulati, Anmol, Lumer, Elias, Campagna, Sindy, Subbiah, Vamse Kumar
Enterprise teams rely on the Jira Query Language (JQL) to retrieve and filter issues from Jira. Yet, to our knowledge, there is no open, real-world, execution-based benchmark for mapping natural language queries to JQL. We introduce Jackal, a novel, large-scale text-to-JQL benchmark comprising 100,000 natural language (NL) requests paired with validated JQL queries and execution-based results on a live Jira instance with over 200,000 issues. To reflect real-world usage, each JQL query is associated with four types of user requests: (i) Long NL, (ii) Short NL, (iii) Semantically Similar, and (iv) Semantically Exact. We release Jackal, a corpus of 100,000 text-to-JQL pairs, together with an execution-based scoring toolkit, and a static snapshot of the evaluated Jira instance for reproducibility. We report text-to-JQL results on 23 Large Language Models (LLMs) spanning parameter sizes, open and closed source models, across execution accuracy, exact match, and canonical exact match. In this paper, we report results on Jackal-5K, a 5,000-pair subset of Jackal. On Jackal-5K, the best overall model (Gemini 2.5 Pro) achieves only 60.3% execution accuracy averaged equally across four user request types. Performance varies significantly across user request types: (i) Long NL (86.0%), (ii) Short NL (35.7%), (iii) Semantically Similar (22.7%), and (iv) Semantically Exact (99.3%). By benchmarking LLMs on their ability to produce correct and executable JQL queries, Jackal exposes the limitations of current state-of-the-art LLMs and sets a new, execution-based challenge for future research in Jira enterprise data.
- North America > United States > North Dakota > Burke County (0.05)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (2 more...)
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Yang, Sihan, Xu, Runsen, Xie, Yiman, Yang, Sizhe, Li, Mo, Lin, Jingli, Zhu, Chenming, Chen, Xiaochen, Duan, Haodong, Yue, Xiangyu, Lin, Dahua, Wang, Tai, Pang, Jiangmiao
Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > North Dakota > Burke County (0.04)
- (2 more...)
- Information Technology (0.46)
- Law (0.46)
Data Scaling Laws for Radiology Foundation Models
Ilse, Maximilian, Sharma, Harshita, Schwaighofer, Anton, Bond-Taylor, Sam, Pérez-García, Fernando, Melnichenko, Olesya, Sykes, Anne-Marie G., Horst, Kelly K., Khandelwal, Ashish, Reynolds, Maxwell, Wetscherek, Maria T., Codella, Noel C. F., Alvarez-Valle, Javier, Panagiotis, Korfiatis, Salvatelli, Valentina
Foundation vision encoders such as CLIP and DINOv2, trained on web-scale data, exhibit strong transfer performance across tasks and datasets. However, medical imaging foundation models remain constrained by smaller datasets, limiting our understanding of how data scale and pretraining paradigms affect performance in this setting. In this work, we systematically study continual pretraining of two vision encoders, MedImageInsight (MI2) and RAD-DINO representing the two major encoder paradigms CLIP and DINOv2, on up to 3.5M chest x-rays from a single institution, holding compute and evaluation protocols constant. We evaluate on classification (radiology findings, lines and tubes), segmentation (lines and tubes), and radiology report generation. While prior work has primarily focused on tasks related to radiology findings, we include lines and tubes tasks to counterbalance this bias and evaluate a model's ability to extract features that preserve continuity along elongated structures. Our experiments show that MI2 scales more effectively for finding-related tasks, while RAD-DINO is stronger on tube-related tasks. Surprisingly, continually pretraining MI2 with both reports and structured labels using UniCL improves performance, underscoring the value of structured supervision at scale. We further show that for some tasks, as few as 30k in-domain samples are sufficient to surpass open-weights foundation models. These results highlight the utility of center-specific continual pretraining, enabling medical institutions to derive significant performance gains by utilizing in-domain data.
- North America > United States > North Dakota > Burke County (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
The AI Memory Gap: Users Misremember What They Created With AI or Without
Zindulka, Tim, Goller, Sven, Fernandes, Daniela, Welsch, Robin, Buschek, Daniel
As large language models (LLMs) become embedded in interactive text generation, disclosure of AI as a source depends on people remembering which ideas or texts came from themselves and which were created with AI. We investigate how accurately people remember the source of content when using AI. In a pre-registered experiment, 184 participants generated and elaborated on ideas both unaided and with an LLM-based chatbot. One week later, they were asked to identify the source (noAI vs withAI) of these ideas and texts. Our findings reveal a significant gap in memory: After AI use, the odds of correct attribution dropped, with the steepest decline in mixed human-AI workflows, where either the idea or elaboration was created with AI. We validated our results using a computational model of source memory. Discussing broader implications, we highlight the importance of considering source confusion in the design and use of interactive text generation technologies.
- Europe > Austria > Vienna (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.05)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine (0.68)
- Education > Educational Setting > Higher Education (0.46)
Crop Pest Classification Using Deep Learning Techniques: A Review
Ejaz, Muhammad Hassam, Bilal, Muhammad, Habib, Usman, Attique, Muhammad, Chung, Tae-Sun
Insect pests continue to bring a serious threat to crop yields around the world, and traditional methods for monitoring them are often slow, manual, and difficult to scale. In recent years, deep learning has emerged as a powerful solution, with techniques like convolutional neural networks (CNNs), vision transformers (ViTs), and hybrid models gaining popularity for automating pest detection. This review looks at 37 carefully selected studies published between 2018 and 2025, all focused on AI-based pest classification. The selected research is organized by crop type, pest species, model architecture, dataset usage, and key technical challenges. The early studies relied heavily on CNNs but latest work is shifting toward hybrid and transformer-based models that deliver higher accuracy and better contextual understanding. Still, challenges like imbalanced datasets, difficulty in detecting small pests, limited generalizability, and deployment on edge devices remain significant hurdles. Overall, this review offers a structured overview of the field, highlights useful datasets, and outlines the key challenges and future directions for AI-based pest monitoring systems.
- North America > United States > Texas > Ellis County (0.14)
- North America > United States > Colorado (0.04)
- North America > United States > North Dakota > Burke County (0.04)
- (7 more...)
- Overview (1.00)
- Research Report > Experimental Study (0.67)
MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them
Zhang, Weichen, Sun, Yiyou, Huang, Pohao, Pu, Jiayue, Lin, Heyue, Song, Dawn
Hallucinations pose critical risks for large language model (LLM)-based agents, often manifesting as hallucinative actions resulting from fabricated or misinterpreted information within the cognitive context. While recent studies have exposed such failures, existing evaluations remain fragmented and lack a principled testbed. In this paper, we present MIRAGE-Bench--Measuring Illusions in Risky AGEnt settings--the first unified benchmark for eliciting and evaluating hallucinations in interactive LLM-agent scenarios. We begin by introducing a three-part taxonomy to address agentic hallucinations: actions that are unfaithful to (i) task instructions, (ii) execution history, or (iii) environment observations. To analyze, we first elicit such failures by performing a systematic audit of existing agent benchmarks, then synthesize test cases using a snapshot strategy that isolates decision points in deterministic and reproducible manners. To evaluate hallucination behaviors, we adopt a fine-grained-level LLM-as-a-Judge paradigm with tailored risk-aware prompts, enabling scalable, high-fidelity assessment of agent actions without enumerating full action spaces. MIRAGE-Bench provides actionable insights on failure modes of LLM agents and lays the groundwork for principled progress in mitigating hallucinations in interactive environments.
- North America > United States > Montana > Roosevelt County (0.04)
- North America > United States > North Dakota > Burke County (0.04)
- North America > United States > Louisiana (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Workflow (1.00)
- Research Report > New Finding (0.67)
Rapid Modeling Architecture for Lightweight Simulator to Accelerate and Improve Decision Making for Industrial Systems
Designing industrial systems, such as building, improving, and automating distribution centers and manufacturing plants, involves critical decision-making with limited information in the early phases. The lack of information leads to less accurate designs of the systems, which are often difficult to resolve later. It is effective to use simulators to model the designed system and find out the issues early. However, the modeling time required by conventional simulators is too long to allow for rapid model creation to meet decision-making demands. In this paper, we propose a Rapid Modeling Architecture (RMA) for a lightweight industrial simulator that mitigates the modeling burden while maintaining the essential details in order to accelerate and improve decision-making. We have prototyped a simulator based on the RMA and applied it to the actual factory layout design problem. We also compared the modeling time of our simulator to that of an existing simulator, and as a result, our simulator achieved a 78.3% reduction in modeling time compared to conventional simulators.
- Asia > Singapore (0.04)
- North America > United States > North Dakota > Burke County (0.04)
- North America > United States > Michigan > Wayne County > Detroit (0.04)
- (5 more...)
- Research Report (0.64)
- Workflow (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Modeling & Simulation (0.94)